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ABSTRACT 



This paper provides a concise summary of 10 years of 
research into defining appropriate statistical models for estimating school 
and teacher effect on student learning and other important educational 
outcomes. It discusses criteria for judging models and presents the formulae 
for the two-stage, two- level student-school hierarchical linear modeling 
(HLM) model that is the model of choice for estimating school effect and the 
two- stage, two- level student -teacher HLM model that is the model of choice 
for estimating teacher effect. Finally, a brief discussion is provided on 
criterion variables and the methods by which they are weighted, predicted, 
and aggregated in the school and teacher effectiveness system. (Contains 1 
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This paper provides a concise summary of ten years of research into 
defining appropriate statistical models for estimating school and teacher 
effect on student learning and other important educational outcomes. It 
discusses criteria for judging models and presents the formulae for the 
two-stage, two-level student-school HLM model that is the model of 
choice for estimating school effect and the two-stage, two-level student- 
teacher HLM model that is the model of choice for estimating teacher 
effect. Finally, a brief discussion is provided on criterion variables and the 
methods by which they are weighted, predicted, and aggregated in the 
school and teacher effectiveness system. 

The need for instructional improvement in the Dallas Public Schools had been 
thoroughly documented over a period of twenty years. After a period of rapid 
achievement growth in the early and mid 1980's, student achievement in the Dallas 
schools had leveled out. In 1990, responding to this need, the District's Board of 
Education appointed a citizen's task force, the Commission on Educational Excellence, to 
formulate recommendations to accelerate the needed improvement. After a year of 
community hearings and extensive study, the Commission recommended a six point plan 
for massive educational reform. At the heart of the Commission's recommendations was 
an accountability system that fairly and accurately evaluated schools and teachers on their 
contributions to accelerating student growth in a number of important and valued 
outcomes schooling. This was coupled with a movement to give schools more decision- 
making authority about personnel, curriculum, and most other aspects of schooling. In 
exchange for this authority, school staffs were to be held accountable for their actions. 
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As part of this recoinniendation, $2.4 million was set aside as an incentive award to 
reward effective schools, and their professional and support staffs. 



It then became the task of the District's Research, Planning, and Evaluation 
Department to develop, pilot test, and implement an evaluation system to accomplish the 
goals of the Commission. The first step in accomplishing this task was the appointment 
of an Accountability Task Force to oversee the process. This task force, consisting of 
teachers, principals, parents, members of the business community, and central office 
administrators, was charged with the responsibility of advising the General 
Superintendent concerning the implementation of a performance incentive plan, working 
with the administration to ensure the validity of the selection procedure and subsequent 
results of the incentive plan, and serving as a review committee to examine any issues 
raised by personnel concerning questions of equity and fairness of the procedures. 
During a year of exhaustive deliberations, a number of requirements for the methodology 
associated with this plan were developed. Among these were: 



1 . It must be value-added. 

2. It must include multiple outcome variables.^ 

3. Schools must only be held accountable for students who have been exposed to 
their instructional program (continuously enrolled students). 

4. It must be fair. Schools must derive no particular advantage by starting with 
high-scoring or low-scoring students, minority or white students, high or low 
socioeconomic level students, or limited English proficient or non-limited 



Performance indicators for 1997-1998 include Iowa Tests of Basic Skills and Tests of Achievement and 
Proficiency reading and mathematics, grades 1-9; Spanish Assessment of Basic Education, grades 1-6; Texas 
Assessment of Academic Skills, reading and mathematics, grades 3-8 and 10; writing, grades 4, 8, and 10; science 
and social studies, grades 4 and 8; Texas Assessment of Academic Skills, Spanish version grades 3-6; 143 
standardized final examinations in language, mathematics, social studies, science, ESOL, reading, and world 
languages, grades 9-12; promotion rate, grades 1-8; student attendance, grades 1-12; graduation rate, grades 9-12; 
Scholastic Aptitude Test percent tested and scores, grades 9-12; dropout rate, grades 7-12; student enrollment in 
prehonors/honors courses, grades 7-12; student enrollment in advanced diploma plans, grades 9-12; students 
enrolled in advanced placement courses, grades 11-12; Preliminary Scholastic Aptitude Test percent tested and 
scores, grades 9-12; and percent passing Advance Placement Exams, grades 1 1-12. The system is operated with 
only continuously enrolled students and includes staff attendance incentives, minimum percent eligible tested 
requirements, and requirements that at least one-half of a school's cohorts must outgrow the national norm group 
on the TTBS and TAP in reading and mathematics. 



English proficient students. In addition such factors as student mobility, school 
overcrowding, and staffing patterns over which the schools have no control 
must be taken into consideration. 

5. It must be based on cohorts of students, not cross-sectional data. 

Within the five aforementioned parameters, a number of statistical models are 
possible. This study examines alternative methodologies for determining school effect 
then extends these studies to the determination of teacher effect. These models are 
designed to isolate the effect of a school's or teacher's practices on important student 
outcomes. The school effect can be conceptualized as the difference between a given 
student s performance in a particular school and the performance that would have been 
expected if that student had attended a school with similar context but with practice of 

average effectiveness. The teacher effect can be conceptualized similarly at the teacher 
level. 



Background 



Interest in performance-based or outcomes-based teacher evaluation dates all the 
way back to fifteenth-century Italy where a teacher master's salary was dependent upon 
his or her students performance. Despite long-term interest, progress in actually lin kin g 
student outcomes to school and teacher performance has been very limited. 

State Departments of Education have taken a leadership role in attempting to 
accomplish this at the school and district level. Forty-six of fifty states have 
accountability systems that feature some type of assessment. Twenty-seven of these 
systems feature reports at the school, district, and state level, three feature school level 
reports only, six feature reports at both the school and district level, seven feature reports 
at the district and state level, two feature reports at the state level only, and one is 
currently under development (Council of Chief State School Officers, 1995). When one 
reviews these systems, it is obvious that their designers are not familiar with the literature 
on value-added systems since only two states. South Carolina (May, 1990) and Tennessee 



(Sanders and Horn, 1995) have used appropriate value-added statistical methodology in 
implementing such systems. Most of the rest tend to evaluate students, not schools or 
districts, and generally cause more harm than good with systematic misinformation about 
the contributions of schools and districts to student academic accomplishments. By 
comparing schools on the basis of unadjusted student outcomes, state reports are often 
systematically biased against schools with population demographics that differ from the 
norm, a fact that was graphically illustrated by Jaeger (1992). In attempting to eliminate 
this bias, a number of states have gone to non-statistical grouping techniques, an 
approach that has serious limitations when there is consistent one-directional variance on 
the grouping characteristics within groups. 

Investigators throughout the world have conducted and reported numerous studies 
aimed at identifying effective schools as well as estimating the magnitude and stability of 
school contributions to student outcomes. Good and Brophy (1986) provide an excellent 
review of this work. Researchers have been working for a number of years on 
appropriate methodology for adjusting for the effects of student and school demographic 
variables in estimating school effects. One approach has been to regress school mean 
outcome measures on school means of one or more background variables. This approach 
is only adequate to the extent that there is not much within school variance, that is, the 
school impacts all students similarly. Mendro and Webster (1993) demonstrated that this 
is generally not the case and that using school level models to attempt to estimate school 
effects, while better than the common practice of reporting unadjusted test scores, 
produces extremely unstable estimates of school effects. 

Another approach, one that has received generally widespread acceptance among 
educational researchers, involves the aggregation of residuals from student-level 
regression models (Aiken and West, 1991; Bano, 1985; Felter and Carlson, 1985; Kirst, 
1986; Klitgard and Hall, 1973; McKenzie, 1983; Millman, 1981; Saka, 1984; Webster 
and Olson, 1988; Webster, Mendro, and Almaguer, 1994). These techniques can 
incorporate a large number of input, process, and outcome variables into an equation and 
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determine the average deviation from the predicted student outcome values for each 
school. Schools are then ranked on the average deviation. Some advantages of multiple 
regression analysis over other statistical techniques for this application include its relative 
simplicity of application and interpretation, its robustness, and the fact that general 
methods of structunng complex regression equations to include combinations of 
categorical and continuous variables and their interactions are relatively straightforward 
(Aiken and West, 1991; Cohen, 1968; Cohen and Cohen, 1975; Darlington, 1990). 

Finally, hierarchical linear modeling (HLM) provides estimates of linear equations 
that explain outcomes for group members as a function of the characteristics of the group 
as well as the characteristics of the members. Because HLM involves the prediction of 
outcomes of members who are nested within groups which in turn may be nested in larger 
groups, the techmque should be well suited for use in education. The nested structure of 
students within classrooms and classrooms within schools produces a different variance 
at each level for factors measured at that level. Bryk, et. al. (1988) cited four advantages 
of HLM over regular linear models. First, it can explain achievement and growth as a 
function of school level or classroom level characteristics while taking into account the 
variance of student outcomes within schools or classrooms. Second, it can model the 
effects of student characteristics, such as gender, race/ethnicity, or socioeconomic status, 
on achievement within schools or classrooms and then explain differences in these effects 
between schools or classrooms using school or classroom characteristics. Third, it can 
model the between and within group variance at the same time and thus produce more 
accurate estimates of student outcomes. Finally, it can produce better estimates of the 
predictors of student outcomes within schools and classrooms by using information about 
these relationships from other schools and classrooms. HLM models are discussed in the 
literature under a number of different titles by different authors from a number of diverse 
disciplines (Bryk and Raudenbush, 1992: Dempster, Rubin and Tsutakawa, 1981; Elston 
and Grizzle, 1962; Goldstein, 1987; Henderson, 1984; Laird and Ware, 1982; Longford, 
1987; Mason, Wong, and Entwistle, 1984; Rosenburg, 1973). 
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Extending this methodology to the teacher level becomes even more complex. The 
issue really is not one of whether or not student achievement data should be used in 
teacher evaluation, but rather entails a methodological debate over ways to operationalize 
and implement such a system. Unfortunately, the preponderance of literature in the field 
concentrates upon reasons student achievement data cannot be used for teacher evaluation 

rather than upon credible ways to use it. Some of the concerns raised in the literature 
include: 



• the development of procedures to account for the difficulty in measuring the 
long-term development of skills which may not be measured in year-to-year 
growth patterns (TEA, 1988). 

the assessment of diverse areas of achievement which do not have readily 
available standardized tests is an area of concern when dealing with non- 
academic area teachers. 

programs which pull out students for remediation, programs which involve 
team-teaching, and programs with extensive use of instructional aides inhibit the 
estimation of an individual teacher's contribution to improved student 
achievement. 

norm-referenced standardized tests sample broad subject domains and are 
unlikely to match closely the curriculum in particular classrooms at particular 
times (Haertel, 1986). 

well-established, broadly applicable, and accepted achievement measures are 
not available in all the relevant areas of learning (Bano, 1985). 

standardized achievement tests are unlikely to reflect the full range of 
instructional goals in their subject areas. Norm-referenced tests tend to ignore 
the higher-order skills. Therefore it is likely that products of superior teaching 

are not measured adequately or completely by standardized achievement tests 
(Bano, 1985). 

• what the student brings to the classroom in terms of ability, home and peer 
influence, motivation and other influences is very powerful in affecting 
academic achievement at the end of the year (Iwanicki, 1986). 

the statistical methods used to control for non-teacher factors cannot take into 
account all of the relevant factors. More importantly, the methods will be 
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incomprehensible to those being evaluated and difficult to defend in public 
(Bano, 1985). 

• non-statistical models for controlling non-teacher factors are easier to explain, 
but cannot take into account most of the necessary circumstances (Bano, 1985). 

• attempting to use any one of a number of regression-based techniques at the 
teacher level creates a rather subtle problem related to the statistical concept of 

degrees of freedom. In general, the number of degrees of freedom upon which 
a statistical procedure is based depends on the sample size (AO and the number 
of sample statistics (i.e., variables in multiple regression). The sample size (i.e., 
number of students) for a teacher is relatively small to start with. However, the 
usable sample size becomes even smaller because development of the regression 
equations requires existing test scores for each student for at least two 
successive years. As an example, a second-grade teacher may have a class of 22 
students, but may only have test scores from the first grade for 1 1 of those 
students. Since degrees of freedom also depends on the number of variables in 
the multiple regression equation, a regression equation with four (4) variables 
would leave just seven (7) degrees of freedom. The stability of a projected 
regression line is primarily dependent on the number of degrees of freedom. 
Seven is generally not enough for stable estimates. As a general rule of thumb, 
thirty students per variable has been recommended as a minimum number upon 
which to base a projected regression line. 

Nontechnical concerns most often found in the literature include the concern that 
objectives that are not measured by the tests will be omitted by teachers, that other duties 
such as playground supervision and school committee work may be slighted, and that, 
with each teacher being rated separately, the collegiality necessary to building good 
instructional teams within a building may be damaged. 



Most of the methodological issues raised above can be resolved. (1) Longitudinal 
growth curves, or alternatively, relationships based upon two years of data, can be 
formulated on important outcome variables. In the case of relationships based upon two 
years of data, replication is necessary to assure greater reliability. (2) Criterion- 
referenced tests can be developed and used to assess diverse areas of achievement. (3) In 
cases where there are pull-out or send-in programs, team teaching, or instructional aides, 
data can be provided at the team level rather than at the individual teacher level. (4) 
Measures in addition to norm-referenced tests can be used. (5) Constituents are primarily 





interested in basic skills. To the extent that measures are needed in music, art, physical 
education, etc., they can be developed. (6) Criterion-referenced tests can be used to 
measure higher-order thinking skills. In addition, performance testing can be used as one 
outcome vanable with the outcomes being weighted by the reliability of the instruments. 
(7) What the student brings to the classroom in terms of background variables can be 
statistically controlled. These variables typically account for 9-20% of the variance in 
student achievement (Webster, Mendro, and Almaguer, 1993). (8) It has been the 

author s experience that gender, ethnicity, limited English proficiency status, and free-or- 
reduced-lunch status, plus their interactions, account for most of the variance that can be 
attributed to background variables. They are easy to explain or defend. (9) Non- 
statistical models for controlling non-teacher factors are misleading and should not be 
used (Webster and Edwards, 1993). (10) The degrees of freedom problem is real in that 
one must worry about the stability of the regression line when it is applied to one teacher. 
At the teacher level, replication over several years is the best safeguard against erring 
because of small sample size. 



Criteria For Judging Models 

The traditional criteria for judging the efficacy of regression-based models is 
goodness of fit (r^). This is not a particularly useful tool for judging the effects of 
statistical models that are designed to rank schools or teachers. If one uses the individual 
student as the unit of analysis and applies either OLS regression or ULM to estimating 
effect, the differences in r^s produced by these models is minute. 

The criterion that we believe should be used in judging the appropriateness of 
statistical models designed to rank schools and teachers is fairness. That is, a school’s or 
teacher’s estimated effectiveness level should not be capable of being predicted by the 
individual or aggregate composition of its student body or classrooms. Variables over 
which the school or teacher have no control should not be correlated with the school’s or 
teacher’s effectiveness rating. These variables include such things as student pretest 
score, ethmcity, economic status, mobility, gender, and limited English proficient status. 
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The models that are proposed in this paper produce results at the school level that 
correlate zero with important school and classroom level contextual variables. 

It is obviously also very important that there be a school or teacher effect. If there 
is none, one is reduced to ranking schools based on random error. This methodology must 
be part of a comprehensive accountability system that provides valid data for decision- 
making and improvement as well as for accountability (Webster, 1998). 

Relevant Results 



All studies summarized in this section examined correlations between indices 
produced by various methods and those produced by the methodology of choice as well 
as correlations with individual student background and classroom and school contextual 
variables. The methodology of choice for producing school effectiveness estimates is a 
two-stage, two-level student-school HLM model while that for producing estimates of 
teacher effect is a two-stage, two-level student-teacher HLM model. Formulae for both 
of these models are specified later in this paper. Results are discussed in relation to the 
models of choice in order to limit the statistics presented and simplify the discussion. 
Detailed backup data are contained in the various referenced papers. 

School Level 

Individual student background variables included gender, socioeconomic status 
(free or reduced lunch, parental income, family poverty index, parental education level), 
ethnicity, limited English proficient status, and pretest scores. Aggregate school 
variables include percent student mobility, percent overcrowded, percent economically 
disadvantaged (same variables as specified above), percent limited English proficient, 
percent Black, percent Hispanic, and percent minority. The majority of studies were done 
at the grades 4-6 levels although additional studies were conducted at grade 8. 
Conducting studies at the different grade levels is significant because the elementary 
levels (grades 4-6) have large numbers of relatively homogeneous schools while the 
middle school level (grade 8) has a small number of relatively heterogeneous schools. 
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Before discussing a number of thoughtful methodologies for estimating school 
effect. It IS important to reiterate that ranking schools based on unadjusted student test 
scores or on gain scores is neither particularly informative nor fair. The results produced 
by these systems correlate poorly with the results produced by the model of choice (r’s< 
.508 for unadjusted test scores, .'.<732 for gain scores) and produce results with 
unacceptably high correlations vsdth student background variables and school aggregate 
variables (as high as r=.648 with parental education level) (Webster, et.al., 1995). 
Evidence suggests that this type of reporting under the guise of determining school or 
teacher effect does severe injustice to teachers and schools that serve poor and minority 
student populations. It is important to note that the backbone of most state accountability 

systems is unadjusted test scores or student gain scores, often not even based on cohort 
data. 

Ordinary least squares regression (OLS) models improve reporting significantly 
over those models discussed in the previous paragraph as long as the models use data at 
the individual student level. Using OLS models with aggregated school level variables 
produces results that correlate poorly with results produced by the model of choice 
('■<■58) as well as correlating highly with student and school level contextual variables. 
Too much information is lost when student data are aggregated to the school level prior to 
analysis. The greater the within group variance of the individual schools the poorer the 
estimates produced by the aggregate models (Mendro and Webster, 1993). 

OLS models that include all of the individual student demographic variables 
presented above as well as relevant pretest scores produce results that are moderately 
correlated with the model of choice (r<.8637) and are relatively unbiased at the individual 
student level (most r’s < .02). Correlations with background variables become higher 
than desired at the school level with the correlation with percent Black reaching -.1225. 
If one looks at a grade with relatively few schools (<30) with high within group variance 
such as grade 8 in this series of studies, the school level correlations explode to r>.40 for 
most socioeconomic status indicators (Webster, et.al., 1997). 

Implementing a two-stage OLS regression model with student demographic 
variables regressed on pretest and posttest variables in stage one and the resulting 
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residuals used in a stage two model predicting posttest from relevant pretests does little to 
improve the equations. Correlations between the indices produced by the one stage 
versus two-stage models consistently hover around r>.95 and the results from the two- 
stage model correlate slightly higher with the model of choice (r<.8878). However, 
correlations with important student background variables are not improved and the 
correlations with school level contextual variables are actually slightly higher (Webster, 
et.al., 1995, 1996, 1997). The reason for the development of the original two-stage OLS 
equations was ease of explanation not statistical parsimony. 

The final series of OLS regression equations examined in these series of studies 
utilized individual student growth curves based on two, three, and four years of data.. No 
demographic data were included in the equations since it was believed that the individual 
student growth curves would serve as a surrogate for student background variables. 
These equations produced results that correlated poorly with the other OLS regression 
models (r<.85), more poorly with the model of choice (r<.75), had unacceptably high 
correlations with student ethnicity (r as high as -.3587 with Black students), and 
registered unacceptable correlations with school level contextual variables {r as high as 
.4621 with percentage Black). Thus it seems obvious that individual student growth 
curves do not contain all of the information necessary for non-biased prediction. As an 
aside, the results produced by using four years of prediction correlated .9992 with the 
results produced using three years of prediction and included about 5% more of the 
student population (Webster and Olson, 1988; Webster, et.al., 1997). 

Moving to hierarchical linear models (HLM), a number of questions were 
investigated across a series of studies. The first involved whether or not the use of a two 
or three-level HLM model would produce improved effectiveness indices, that is, indices 
that were not correlated with student level or school level contextual variables. It is 
important to note that the correlation between comparable OLS regression models and 
two-stage, two-level HLM models was r>.97. (The correlation of the two-stage OLS 
regression model with the HLM model of choice was only r<.8878 because the HLM 
model of choice had additional student backgroimd and school contextual variables.) 
Both models (OLS and HLM) produced minimal correlations with student level 
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background variables (r<.01) but the OLS regression model produced correlations with 
school level contextual variables as high as -.1794 while the HLM model caused all of 
those correlations to be zero (Webster, et.al., 1995, 1996, 1997). 

The basic three-level HLM model that was designed to include comparable 
student and school-level contextual variables would not run in either a one-stage or two- 
stage form. Although several models were attempted, major problems were encountered 
with the algorithms for solving them. In short, in order to successfully run a three-level 
HLM status model many important contextual variables had to be eliminated from the 
equations resulting in models that produced unacceptably high correlations with non- 
controlled contextual variables (Webster, et. al., 1995). 

We also examined a three-level HLM model using gain scores as the unit of 
analysis instead of pretest and posttest scores. We compared the results obtained from 
comparable two level student-school pretest-posttest HLM models with those obtained 
from three-level gain score-student-school HLM models and got virtually identical results 
(r's >.98). Two level models are more convenient and efficient than three level models 
because they can accommodate more level one student and level two school contextual 
vanables and they are not nearly as sensitive to multicollinearity and low variance in 
conditioning variables as are three-level models. Whether fixed or random slopes are 
assumed, the number of second and third level conditioning variables are severely limited 
m the three-level model. The inability to accommodate sufficient conditioning variables 
in the three-level HLM gam score model causes results that correlate poorly with the 
model of choice (r<.40) and produces correlations as high as .1887 with important 
student background variables as well as producing results that are highly correlated with 
important school contextual variables that the models are not able to accommodate (r 's as 
high as .4747) (Webster, et.al., 1997; Weerasinghe, et.al., 1997). 

Throughout the course of the various studies several other important issues were 
investigated. The issue of one-stage versus two-stage models, be they OLS regression or 
HLM, is moot. Correlations between and among one-stage versus two-stage models 
consistently hover around r>.95 for comparable models and there generally is no 
practical difference when results are correlated with background variables (Webster, 
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et.al., 1997). However, the correlations of residuals produced by one-stage HLM models 
■with student level contextual variables suggest that one-stage HLM models carry 
suppresser effects that are not found in two-stage HLM models or OLS regression 
models. When this result is coupled with the inability to include important school level 
contextual variables in one-stage HLM models because of limitations of the models 
resulting in unsatisfactory correlations with those contextual variables, two stage models 
are the models of choice. 

The final issue investigated was the fixed versus random slopes issue. 
Correlations between and among comparable models assuming fixed versus random 
slopes were generally around r>.98. Models studied were all two-stage HLM models 
since one-stage HLM models including a full array of contextual variables and assuming 
random slopes could not be solved. These models produced low correlations with 
student-level background variables and, when school level conditioning variables were 
added, zero correlations with school level variables. Since the random model controls for 
the effects of possible interactions of concomitant variables in specific school settings, 
and there is no difficulty in solving the two-stage, two-level HLM equations assuming 
random slopes, our models are random slope models. 



Based on the analyses conducted through the series of studies reported in this 
paper, the authors believe that an HLM two-stage, two-level random model with a full 
range of student and school level contextual variables produces the most bias-free 
estimates of school effect. 

Student level variables included in the HLM models are: 



Formulae for Estimating School Effect 




Outcome variable of interest for each student i in school j. 
Black English Proficient Status (1 if black, 0 otherwise). 
Hispanic English Proficient Status (1 if Hispanic, 0 otherwise). 
Limited English Proficient Status (1 if LEP, 0 otherwise). 
Gender (1 if male, 0 if female). 

Free or Reduced Lunch Status (1 if subsidized, 0 otherwise). 
Block Average Family Income. 

Block Average Family Education. 



Xsij - Block Average Family Poverty Level. 

Xicij = Indicates the variable k of student in school j for i = 1,2, ..., Ij andy = 1, 

School level variables included in the HLM models are: 

Wjj = School Mobility. 

^2j = School Overcrowdedness. 

^3j ~ School Average Family Education. 

W4j = School Average Family Education. 

Wjj = School Average Family Poverty Index. 

Wgj = School Percentage on Free or Reduced Lunch. 

Wyj = School Percentage Minority. 

Wsj = School Percentage Black. 

W9j = School Percentage Hispanic. 

Wioj = School Percentage Limited English Proficient. 

W, ij = School Percentage Instructional Days Lost to Unfilled Vacancies. 

Stage 1 : 



Yij 



- Aq + AiX,ij + A2X2ij + AjXjij + A4X4ij + AjXjij + AgXgjj + A7X7j: + 
AgX^j + A9(XiijX4ij) + Aio(X2ijX4ij) + A„(X3ijX4ij) + Ai2(XiijX5ij) + 
Ai3^2ijX5ij) + Ai4(X3ijX5ij) + Ai5(X4ijX5ij) + Ai6(XiijX4ijX5ij) + 

An(X2ijX4ijX5ij) + Ai3(X3ijX4ijX5ij) + rij 



Stage 2: 



Level 1: 



Criterion Variable_R_98ij = poj + Pij Ri_97ij +...+ p„jR,^97ij + 5. 
where ■' 

Criterion Variable_R_98jj = posttest residual from stage one 

Rn_97 = pretest residual from stage one 

ad 

5ij ~ N(0,a'). 

Level 2: 

Pkj “ 7k0 + YkiWij +7k2W2j + . . . + y,j,jWiij + W|jj 

for A: = 0, 1, 2, ..., n. 

E[Wkj] = 0, Var/Cov[Wkj] = T, and 1 5^ 

SEIj* = «oj* 



O 
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Teacher Level 



Effectiveness indices at the teacher level are somewhat more complex and require 
great care in interpretation and use. The Dallas Public Schools uses classroom 
effectiveness indices as part of the needs assessment in the teacher evaluation system, not 
as an evaluative tool per se. Teachers are required to formulate strategies to remediate 
problems detected through the classroom effectiveness indices, student skills analyses, 
and other formative data. They are then evaluated on how well they implement those 
strategies. 

Since we are under severe time constraints in the production of classroom 
effectiveness indices, the most parsimonious solution was sought. Working off the 
information that we had obtained in the school effectiveness indices research, we had 
hoped to be able to disaggregate the school data to classroom level, apply an adjustment 
for shrinkage, and produce classroom effectiveness indices. The indices produced in this 
manner correlate very highly with those produced by other legitimate models, but 
produce unacceptably high correlations with classroom level contextual variables 
(Sanders, 1 997, Webster, et.al., 1997). Thus, although it is more time-consuming to 
compute, the model of choice for producing classroom effectiveness indices is a two- 
stage, two-level student-classroom random model HLM. ^ 

Formulae for Estimating Teacher Effect 
Student level variables included in the HLM classroom effects models are: 

~ Outcome variable of interest for each student z in classroom j. 

X,ij = Black English Proficient Status (1 if black, 0 otherwise). 

X 2 ij = Hispanic English Proficient Status (1 if Hispanic, 0 otherwise). 

X3ij = Limited English Proficient Status (1 if LEP,0 otherwise). 

X 4 jj = Gender (1 if male, 0 if female). 

^5ij ~ Free or Reduced Lunch Status (1 if subsidized, 0 otherwise). 

Xgjj = Block Average Family Income. 



Interesting enough, the model that produces the most bias-free estimates at the classroom level without 
using classroom conditioning variables is a two-stage OLS regression model. Thus, if one is under serious 
time constraints, a two-stage OLS regression model with an adjustment for shrinkage would produce 
results that are very close to the model of choice. 
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Block Average Family Education. 

Block Average Family Poverty Level. 

Indicates the variable k of student in classroom j for i = 1,2, ..., /,• and 



Classroom level variables included in the HLM models are: 



■'^13p^2ijX5ij) + A]4(X3ijX5ij) + Ai5(X4ijX5jj) + A]5pC]ijX4jjX5ij) + 
Ai7CX2ijX4ijX5ij) + A,8pC3ijX4ijX5ij) + ry 



Stage 2: 

Level 1; 

Criterion Variable_R_98ij = Poj + Pij Ri_97ij +...+ pnjR^97y + 5y 
where 

Criterion Variable_R_98jj = posttest residual from stage one 




Classroom Mobility. 

Classroom Overcrowdedness. 

Classroom Average Family Education. 

Classroom Average Family Education. 

Classroom Average Family Poverty Index. 
Classroom Percentage on Free or Reduced Lunch. 
Classroom Percentage Minority. 

Classroom Percentage Black. 

Classroom Percentage Hispanic. 

Classroom Percentage Limited English Proficient. 



Stage 1: 



Ao + A,X,ij + A2X2ij + A3X3ij + A4X4ij + AjXjij + AgXgy + A7X7ij + 
AgXgij + A9pC,ijX4ij) + A,opC2ijX4ij) + A,,(X3ijX4ij) + A,2(X,jjX5ij) + 



Rn_97 - pretest residual from stage one 



iid 



5y ~ N(0,a'). 



Level 2: 



Pkj 



Yko + YkiTij + Yk2T2j + . . . + YkioTioj + Wkj 

for A: = 0, 1, 2, ..., n. 
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E[wkj] = 0, Var/Cov[Wkj] = T, and -L 5y 

CEIj* = «oj 



Outcome Variables and Associated Equations 

Figure 1 shows the nature of the equations used in the generation of school 
effectiveness indices. Each outcome variable is described under "Outcome" along with 
the grades at which it is included, the score that is the basis for the analysis, the 
methodology utilized, the level at which the data are analyzed (student or school level), 
possible predictors and the grades at which they are found, and the school level 
conditioning variables included in the student level equations. Two different regression 
models are used depending on whether the unit of analysis is the student, in which case 
hierarchical linear modeling is used, or the school, in which case multiple regression 
analysis is used. Through these approaches it is possible to obtain extremely reliable 
predictions of student and school outcomes and to compare actual outcomes to those that 
are predicted. All analyses that are done at the student level are calculated on residuals, 
that is, statistics that have had individual student characteristics over which the schools 
have no control removed from the equations (gender, ethnicity, limited English proficient 
status, socioeconomic status, and all of the interactions between those variables). 

Classroom effectiveness indices are computed utilizing student and classroom 
data (classroom level conditioning variables) for the Iowa Tests of Basic Skills, the Tests 
of Achievement and Proficiency, the Texas Assessment of Academic Skills, the Texas 
Assessment of Academic Skills-Spanish, the Spanish Assessment of Basic Education, the 
Assessments of Course Performance, and the Woodcock-Munoz Language Survey. 

Summary 
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This paper has described statistical models for estimating school and teacher 
effect. It has summarized ten years of research on developing these models and has 
specified the models in use in the Dallas Public Schools. The school and classroom 
effectiveness indices described in this paper are part of a comprehensive evaluation 
system that provides data for both decision-making and accountability. That system is 
focused on continuous improvement and includes both absolute measures through the 
School and District Improvement Plans and relative measures through the school and 
classroom effectiveness indices. Webster (1998) provides a comprehensive overview of 
that system. 



Figure 1. Description of Variables and Methodology 
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